feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
Review: Needs Work
Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:
1. jq promoted from optional to required (breaking)
01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:
- Auto-install
jq(like Docker is auto-installed in phase 05), or - Keep it optional with graceful degradation when absent
2. No CI checks have run
This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.
3. Docker Compose GPU reservation conflict
docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.
4. Minor: duplicate comment line
constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.
What's good
- The topology detection with
nvidia-smi topo -mfallback is well-handled assign_gpus.pyalgorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts- Single-GPU path is preserved (gated on
GPU_COUNT > 1) - Graceful degradation when
nvidia-smiis absent
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
a49941c to
7ddc533
Compare
|
@Lightheartdevs Thanks for the thorough review! I adjusted the PR. 1. jq auto-install - Good catch. I've added auto-install logic for jq. 4. INSTALL_START_EPOCH duplication - Fixed! I appreciate the detailed feedback! |
Review Update — Rebase Required Before MergeHey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first. 🚨 Silent merge bug:
|
- Enhanced multi-GPU tier assignment based on topology - Implemented robust GPU topology detection for NVIDIA - Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases - Implemented gathering detailed per-GPU information - Data structures for GPU information storage - Robust and comprehensive test suite for NVIDIA topology detection - Multi-GPU strategy selection algorithm - Careful handling of edge cases and subtle bugs in strategy selection - Robust test suite for multi-GPU strategy selection algorithm GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups Adjust env schema validation Fixed inconsistencies in gpu count, json escaping issues, etc fix issue with writing multigpu overlay fix resolve-compose-stack.sh multi gpu overlay fix gpu device id Refactors + less convoluted docker compose setup N_GPU_LAYERS validation fix multi-gpu overlay
Code Review — 2026-03-24Overall: Well-architected PR. The topology detection, strategy selection, and compose overlay pattern are all clean and consistent with existing conventions. Good test coverage (7 fixtures). Not merging yet due to conflicts and testing requirements, but this is on track. Merge Conflicts (5 files)This PR was branched before the Lemonade integration (19 PRs merged 2026-03-24). The following files will conflict:
Action needed: Rebase onto current Revision Requests
Testing Requirements
What's Good
Status: Defer merge until conflicts resolved and multi-GPU hardware testing available. Happy to help with conflict resolution when ready. |
7ddc533 to
ffcc3a0
Compare
|
Hey @Lightheartdevs! Thanks for the thorough review across installer files, really appreciate the care that went into this. Good catch on the potential silent drop, I rebased and made sure all Revision requestsINTERACTIVE guard: confirmed and added an explicit early-return guard at the top of jq: After I checked deeper, it turns out jq was already a hard dependency before this PR.
Temp file cleanup: added Hardware testingI tested this end-to-end on several multi-GPU machines before, install through compose up, services running on assigned GPUs, and confirmed Thanks again for the detailed feedback! |
Code Review — 2026-03-25Solid multi-GPU feature. Architecture is clean, test coverage is thorough, single-GPU and AMD paths are unaffected. Ready to merge with minor notes: Non-blocking observations
What's good
Approved. ✅ |
feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism
Summary
Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.
Architecture
Topology Detection (
nvidia-topo.sh)Parses the
nvidia-smi topo -mmatrix to extract GPU-to-GPU link types and assigns numerical ranks:GPU Assignment Algorithm (
assign_gpus.py)Four-phase pipeline:
Compose Layering
When
GPU_COUNT > 1, the stack adds:docker-compose.multigpu.yml— llama-server GPU pinning + split modeextensions/services/*/compose.multigpu.yaml— per-service GPU pinningInteractive TUI
Multi-GPU systems get a configuration prompt:
assign_gpus.pywith detected topologyNon-interactive installs default to automatic assignment.
Test coverage
Automated tests
tests/test-nvidia-topo.sh— Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrixtests/test-assign-gpus.py— Comprehensive pytest suite covering:Manual hardware testing
Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):
All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.
What changed
New files
installers/lib/nvidia-topo.shnvidia-smi topo -mmatrix into structured JSON with link types, ranks, and labelsscripts/assign_gpus.pydocker-compose.multigpu.ymlNVIDIA_VISIBLE_DEVICES,LLAMA_ARG_SPLIT_MODE, andLLAMA_ARG_TENSOR_SPLITextensions/services/comfyui/compose.multigpu.yamlextensions/services/whisper/compose.multigpu.yamlextensions/services/embeddings/compose.multigpu.yamltests/test-nvidia-topo.shtests/test-assign-gpus.pytests/fixtures/topology_json/*.json(8 files)tests/fixtures/topology_matrix/*.txt(7 files)nvidia-smi topo -moutput fixtures for shell-level testingModified files
installers/phases/01-preflight.shjqandpython3to preflight dependency checks (required by topology detection and assignment)installers/phases/02-detection.shdetect_nvidia_topo()— populatesGPU_TOPOLOGY_JSON,GPU_HAS_NVLINK,GPU_TOTAL_VRAM,LLM_MODEL_SIZE_MBinstallers/phases/03-features.shinstallers/phases/04-requirements.shinstallers/phases/06-directories.shGPU_ASSIGNMENT_JSONand per-service GPU UUIDs to.envinstallers/lib/constants.shinstallers/lib/tier-map.shinstallers/lib/compose-select.shdocker-compose.multigpu.ymlwhenGPU_COUNT > 1scripts/resolve-compose-stack.sh--gpu-countflag; discovers and mergescompose.multigpu.yamlfrom extensionsscripts/detect-hardware.shnvidia-topo.shfor topology detectionscripts/build-capability-profile.shgpu.countfrom capability profile instead of hardcoding1.env.schema.jsonGPU_ASSIGNMENT_JSON_B64,LLAMA_SERVER_GPU_UUIDS,LLAMA_ARG_SPLIT_MODE,LLAMA_ARG_TENSOR_SPLIT,EMBEDDINGS_GPU_UUID,COMFYUI_GPU_UUID,WHISPER_GPU_UUID,N_GPU_LAYERS